Impact on Performance of Hypertext Classification of Selective Rich HTML Capture

نویسندگان

  • Houda Benbrahim
  • Max Bramer
چکیده

Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and (ii) how to deal with the very high number of features of texts. A hypertext dataset and three well-known learning algorithms (Naïve Bayes, K-Nearest Neighbour and C4.5) were used to exploit the enriched text representation along with feature reduction. The results showed that enhancing the basic text content with HTML page keywords, title and anchor links improved the accuracy of the classification algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neighbourhood Exploitation in Hypertext Categorization

As the web expands exponentially, the need to put some order to its content becomes apparent. Hypertext categorization, that is the automatic classification of web documents into predefined classes, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich ...

متن کامل

Presenting HTML Structure in Audio: User Satisfaction with Audio Hypertext

1.0 Audio on the WWW Every day, more information is being made available online in the form of electronic documents. Since the advent of the World Wide Web (WWW), hypertext (in particular, HTML) has become the medium of choice for the presentation of these documents. This is because HTML allows for the design of rich document structure, including tables, images, and hyperlinks, via a relatively...

متن کامل

Enhanced Information Retrieval by Using HTML Tags

Whenever digital libraries or knowledge management systems are to be automatically filled with web pages from the internet, document classification of the web pages is one of the major challenges. We present an approach which uses HTML tags in order to improve the quality of the hypertext document classification. Our approach uses weighting of HTML tags for separating relevant information in hy...

متن کامل

Building Adaptive Rich Interfaces for Interactive Ubiquitous Applications

The emerging of the Web 2.0 (O’Reilly, 2005) has allowed users more interactivity with Web applications. Among the striking features of Web 2.0 applications, the use of rich interfaces that afford users a more meaningful experience with these applications stands out. In this context, the so-called Rich Internet Applications (RIAs) have transposed the boundaries of simple interfaces built only i...

متن کامل

A Study of Approaches to

Hypertext poses new research challenges for text classiication. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classiication problems is an op...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004